Visual Introduction to Maximum Likelihood Estimation

Jason Bryer, Ph.D.

2021-01-31

Maximum Likelihood Estimation (MLE) is an important procedure for estimating parameters in statistical models. It is often first encountered when modeling a dichotomous outcome variable vis-à-vis logistic regression. However, it is the backbone of generalized linear models (GLM) which allow for error distribution models other than the normal distribution. This document was inspired by the blog post by Jorge Cimentada to provide some additional details about each step of MLE.

We will begin with a typical bivariate regression using the mtcars data set where we wish to predict mpg (miles per gallon) from wt (weight in 1,000 lbs). Figure 1 is a scatter plot showing the relationship between these two variables.

Figure 1. Scatter plot of weight versus miles per gallan.

Figure 1. Scatter plot of weight versus miles per gallan.

Our goal is to estimate

\[Y_{mpg} = \beta_{wt} X + e\] where \(\beta_{wt}\) is the slope and \(e\) is the intercept.

Ordinary Least Squares

With ordinary least squares (OLS) regression our goal is to minimize the residual sum of squares (RSS):

\[RSS=\sum^{n}_{i=1} \left( y_{i}-f(x_{i})\right)^{2}\]

where \(y_i\) is the variable to be predicted, \(f(x_i)\) is the predicted value of \(y_i\), and \(n\) is the sample size. Here is what we know about regression:

We can easily calculate the RSS for various correlations (\(r\)) ranging between -1 and 1. Figure 2 visualizes the RSS.

y <- mtcars$mpg
x <- mtcars$wt
mean.y <- mean(y)
mean.x <- mean(x)
sd.y <- sd(y)
sd.x <- sd(x)
ols <- tibble(
    r = seq(-1, 1, by = 0.025),            # Correlation
    m = r * (sd.y / sd.x),                 # Slope
    b = mean.y - m * mean.x                # Intercept
) %>% rowwise() %>%
    mutate(ss = sum((y - (m * x + b))^2)) %>% # Sum of squares residuals
    as.data.frame()
datatable(ols)
Figure 2. Residual sum of squares.

Figure 2. Residual sum of squares.

The correlation with the correlation the resulted in the smallest RSS is -0.875.

ols %>% filter(ss == min(ss)) # Select the row with the smallest sum of squares residuals
##        r         m       b       ss
## 1 -0.875 -5.389687 37.4306 278.3826

Calculating the correlation in R gives us -0.8676594 and the slope is -5.3444716 which is close to our estimate here. We could get a more accurate result if we tried smaller steps in the correlation (see the by parameter in the seq function above).

Minimizing RSS Algorithmically

This approach works well here because the correlation is bounded between -1 and 1 and we can easily calculate the RSS for a bunch of possible correlations. However, there are more efficient ways of finding the correlation that minimizes the RSS than trying correlations equally distributed across the range. For example, consider the following algorithm:

  1. Calculate the RSS for \(r = 0\).
  2. Calculate the RSS for \(r = 0.5\) If \(RSS_{0.5} < RSS_{0}\) then calculate the RSS with \(r = 0.75\), else calculate the RSS with \(r = -0.5%\)

We can repeat this procedure, essentially halving the distance in each iteration until we find a sufficiently small RSS. This process is, in essence, the idea of numerical optimization procedures. In R, the optim function implements the Nedler-Mead method for optimizing a set of parameters. The details of how the algorithm works is beyond the scope of this article (see this interactive tutoral by Ben Frederickson for a good introduction), instead we will focus on what the algorithm does. To begin, we must define a function that calculates a metric for which the optimizer is going to minimize (or maximize). Let’s start with RSS:

# Change to residSumSquares
sumsquares <- function(parameters, predictor, outcome) {
    a <- parameters[1] # Intercept
    b <- parameters[2] # beta coefficient
    predicted <- a + b * predictor
    residuals <- outcome - predicted
    ss <- sum(residuals^2)
    return(ss)
}

The parameters is a vector of the parameters the algorithm is going to minimize (or maximize). Here, these will be the slope and intercept. The predictor and outcome are parameters passed through from the ... parameter on the optim function and are necessary for us to calculate the RSS. We can now get the RSS for any set of parameters.

sumsquares(c(37, -5), mtcars$wt, mtcars$mpg)
## [1] 303.5247

Small Digression In order to explore each step of the algorithm, we need to wrap the optim function to capture the parameters and output of the function. This function will overwrite the optim function in the current environment to add to elements to the returned list: iterations is the raw list of the parameters and output saved and iterations_df is a data.frame version. Typically overwriting a function like this is bad practice (please don’t email me) but is efficient for my purposes here.

optim_save <- function(par, fn, ...) {
    iterations <- list()
    wrap_fun <- function(parameters, ...) {
        n <- length(iterations)
        result <- fn(parameters, ...)
        iterations[[n + 1]] <<- c(parameters, result)
        return(result)
    }
    optim_out <- stats::optim(par, wrap_fun, ...)
    optim_out$iterations <- iterations
    optim_out$iterations_df <- as.data.frame(do.call(rbind, iterations))
    names(optim_out$iterations_df) <- c(paste0('Param', 1:length(par)), 'Result')
    optim_out$iterations_df$Iteration <- 1:nrow(optim_out$iterations_df)
    return(optim_out)
}

We can now call the optim_save function with our sumsquares function. We initialize the algorithm with two random values for the intercept and slope, respectively. Note that we are using Broyden, Fletcher, Goldfarb, and Shanno optimization method which allows for the specification of bounds on the parameter estimates which we will use later.

optim.rss <- optim_save(
    par = runif(2),
    fn = sumsquares, 
    method = "L-BFGS-B",
    predictor = mtcars$wt,
    outcome = mtcars$mpg
)

The par parameter provides the final parameter estimates.

optim.rss$par
## [1] 37.285116 -5.344469

We can see that the parameters are accurate to at least four decimal places to the OLS method used by the lm function.

lm.out <- lm(mpg ~ wt, data = mtcars)
lm.out$coefficients
## (Intercept)          wt 
##   37.285126   -5.344472

It took the optim function 65 iterations to find the optimal set of parameters that minimized the RSS. Figure 3 shows the value of the parameters (i.e. intercept and slope) and the RSS for each iteration.

df <- optim.rss$iterations_df
names(df) <- c('Intercept', 'Slope', 'SumSquares', 'Iteration')
df %>% melt(id.var = 'Iteration') %>%
    ggplot(aes(x = Iteration, y = value, color = variable)) +
    geom_point(size = 1) + geom_path() +
    facet_wrap(~ variable, scales = "free_y", ncol = 1) +
    xlab('Iteration') + ylab('') + theme(legend.position = 'none')
Figure 3. Output of the optimizaiton procedure at each iteration.

Figure 3. Output of the optimizaiton procedure at each iteration.

Likelihood

Now that we have laid the groundwork for finding parameters algorithmically, we need to introduce another way of evaluating how well parameters fit the data, namely the likelihood. First, let’s revisit what we are doing in OLS. Figure 4 is a scatter plot of our observations, the OLS regression line in blue, and one observation highlighted in red with the residual as a red line. With OLS, we square the residual for every observation, thereby making all values positive, and summing them. There is, however, another way of estimating fit that doesn’t rely on the residuals.

pt <- 1 # Which observation do we want to explore
mtcars$fitted_mpg <- fitted(lm.out)
a <- lm.out$coefficients[1]
b <- lm.out$coefficients[2]
sigma <- summary(lm.out)$sigma
fitted.pt <- mtcars[pt,] * a + b
ggplot(mtcars, aes(x = wt, y = mpg)) +
    geom_point() +
    geom_segment(data = mtcars[pt,], color = 'red', size = 1,
                 aes(x = wt, xend = wt, y = mpg, yend = fitted_mpg)) +
    geom_point(data = mtcars[pt,], color = 'red', size = 4) +
    geom_smooth(method = 'lm', formula = y ~ x, se = FALSE)
Figure 4. Scatter plot with residuals for one observation.

Figure 4. Scatter plot with residuals for one observation.

We often think of probabilities as the areas under a fixed distribution. For example, the first car in mtcars is Mazda RX4 with an average miles per gallon of 21 and weighs 2620lbs. The probability of a car with a miles per gallon less than Mazda RX4 given the data we have in mtcars is 0.5599667 and is depicted in Figure 5.

ggplot() +
    stat_function(fun = dnorm, n = 101, geom = "line",
                  args = list(mean = mean(mtcars$mpg),
                              sd = sd(mtcars$mpg))) +
    stat_function(fun = dnorm, n = 101, geom = "area", fill = "steelblue",
                  args = list(mean = mean(mtcars$mpg),
                            sd = sd(mtcars$mpg)),
                  xlim = c(mean(mtcars$mpg) - 3 * sd(mtcars$mpg), mtcars[pt,]$mpg)) +
    geom_segment(aes(x = mtcars[pt,]$mpg, xend = mtcars[pt,]$mpg),
                 y = 0, yend = dnorm(y[pt], mean(mtcars$mpg), sd(mtcars$mpg))) +
    xlim(mean(mtcars$mpg) - 3 * sd(mtcars$mpg), mean(mtcars$mpg) + 3 * sd(mtcars$mpg)) +
    xlab('Miles Per Gallon') + ylab('Density')
Figure 5. Probability distribution of miles per gallan.

Figure 5. Probability distribution of miles per gallan.

For probabilities, we are working with a fixed distribution, that is:

\[pr(data\ |\ distribution)\] The likelihood are the y-axis values (i.e. density) for fixed data points with distributions that can move, that is:

\[L(distribution\ |\ data)\] The likelihood is the height of the density function. Figure 6 depicts two likelihood for two observations. The mean of each distribution is equal to \(\beta_{wt} X + e\) and the intercept (also known as the error term) defines the standard deviation of the distribution.

pt1 <- 1
p1 <- ggplot() +
    stat_function(fun = dnorm, n = 101,
                  args = list(mean = a + b * mtcars[pt1,]$wt,
                              sd = sigma)) +
    geom_segment(aes(x = mtcars[pt1,]$mpg, xend = mtcars[pt1,]$mpg),
                     y = 0, yend = dnorm(y[pt1], a + b * x[pt1], sigma)) +
    geom_point(aes(x = mtcars[pt1,]$mpg, y = dnorm(y[pt1], a + b * x[pt1], sigma)),
               color = 'red', size = 4) +
    xlim(mean(y) - 3 * sd(y), mean(y) + 3 * sd(y)) +
    xlab('') + ylab('Density')
pt2 <- 5
p2 <- ggplot() +
    stat_function(fun = dnorm, n = 101,
                  args = list(mean = a + b * mtcars[pt2,]$wt,
                              sd = sigma)) +
    geom_segment(aes(x = mtcars[pt2,]$mpg, xend = mtcars[pt2,]$mpg),
                     y = 0, yend = dnorm(y[pt2], a + b * x[pt2], sigma)) +
    geom_point(aes(x = mtcars[pt2,]$mpg, y = dnorm(y[pt2], a + b * x[pt2], sigma)),
               color = 'red', size = 4) +
    xlim(mean(y) - 3 * sd(y), mean(y) + 3 * sd(y)) +
    # xlim((a + b * x[pt2]) - 3 * sigma, (a + b * x[pt2]) + 3 * sigma) +
    xlab('Miles per Gallon') + ylab('Density')
plot_grid(p1, p2, ncol = 1)
Figure 6. Likelihood of a car having the observed mpg given the model parameters for two observations.

Figure 6. Likelihood of a car having the observed mpg given the model parameters for two observations.

We can then calculate the likelihood for each observation in our data. Unlike OLS, we now want to maximize the sum of these values. Also, we are going to use the log of the likelihood so we can add them instead of multiplying. We can now define our log likelihood function:

# TODO: Rename error to maybe root mean squared error (sd of the residuals?)
loglikelihood <- function(parameters, predictor, outcome) {
    a <- parameters[1]     # intercept
    b <- parameters[2]     # slope / beta coefficient
    sigma <- parameters[3] # error
    ll.vec <- dnorm(outcome, a + b * predictor, sigma, log = TRUE)
    return(sum(ll.vec))
}

Note that we have to estimate a third parameter, sigma, which is the error term and defines the standard deviation for the normal distribution for estimating the likelihood. We can now calculate the log-likelihood for any combination of parameters.

loglikelihood(c(37, -5, sd(mtcars$mpg)),
              predictor = mtcars$wt,
              outcome = mtcars$mpg)
## [1] -91.06374

Maximum Likelihood Estimation

We can now use the optim_save function to find the parameters that maximize the log-likelhood. Note two important parameter changes:

  1. We are specifying the lower parameter so that the algorithm will not try negative values for sigma since the variance cannont be negative.
  2. The value for the control parameter indicates that we wish to maximize the values instead of minimizing (which is the default).
optim.ll <- optim_save(
    runif(3),                     # Random initial values
    loglikelihood,                # Log-likelihood function
    lower = c(-Inf, -Inf, 1.e-5), # The lower bounds for the values, note sigma (error), cannot be negative
    method = "L-BFGS-B",
    control = list(fnscale = -1), # Indicates that the maximum is desired rather than the minimum
    predictor = mtcars$wt,
    outcome = mtcars$mpg
)

We can get our results and compare them to the results of the lm function and find that they match to at least four decimal places.

optim.ll$par[1:2]
## [1] 37.285114 -5.344468
lm.out$coefficients
## (Intercept)          wt 
##   37.285126   -5.344472

Figure 7 shows the estimated regression line for each iteration of the optimization procedure (on the left; OLS regression line in blue; MLE regression line in black) with the estimated parameters and log-likelihood for all iterations on the left.

df <- optim.ll$iterations_df
names(df) <- c('Intercept', 'Slope', 'Sigma', 'LogLikelihood', 'Iteration')
p1 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
    geom_smooth(method = lm, formula = y ~ x, se = FALSE) +
    geom_abline(data = df, aes(intercept = Intercept, slope = Slope)) +
    geom_point(data = mtcars, aes(x = wt, y = mpg)) +
    transition_time(Iteration) +
    labs(title = "Iteration: {frame_time}") +
    shadow_wake(wake_length = 0.1, alpha = FALSE)
p1_gif <- animate(p1, width = 480, height = 480)

df.melt <- df %>% melt(id.var = 'Iteration')
p2 <- ggplot(df.melt, aes(x = Iteration, y = value, color = variable)) +
    geom_vline(data = data.frame(Iteration2 = df$Iteration),
               aes(xintercept = Iteration2, frame = Iteration2)) +
    geom_path() +
    facet_wrap(~ variable, scales = "free_y", ncol = 1) +
    xlab('Iteration') + ylab('Parameter Estimate') +
    transition_time(Iteration2)
p2_gif <- animate(p2, width = 480, height = 480)

new_gif <- image_append(c(p1_gif[1], p2_gif[1]))
for(i in 2:100){
    combined <- image_append(c(p1_gif[i], p2_gif[i]))
    new_gif <- c(new_gif, combined)
}
new_gif
Figure 7. Animation of parameter estimates for each iteration of the optimization procedure.

Figure 7. Animation of parameter estimates for each iteration of the optimization procedure.

Figure 8 superimposes the normal distribution from which the log-likelihood is determined. The distribution is centered on \(\hat{y}\). The height of the distribution (i.e. density) at \(y\) is the likelihood. We take the log of this value to get the log-likelihood. These log-likelihoods are calculated for each observation and summed. Maximum likelhood estimation is attempting to find the parameters (i.e. slope and intercept) that maximizes the log-likelhood.

intercept <- optim.ll$par[1]
slope <- optim.ll$par[2]
sigma <- optim.ll$par[3]
pt <- 2

heightFactor <- 1 # Increase the multiplier to make the distribution higher
k <- 4 # How many standard deviations to plot
x <- seq(-k * sigma, k * sigma, length.out = 50)
y <- dnorm(x, 0, sigma) / dnorm(0, 0, sigma) * heightFactor
x0 <- mtcars[pt,]$wt
y0 <- slope * x0 + intercept
path <- data.frame(x = y + x0, y = x + y0)
segment <- data.frame(x = x0, y = y0 - k*sigma, xend = x0, yend = y0 + k*sigma)
segment2 <- data.frame(x = x0, 
                       y = mtcars[pt,]$mpg, 
                       xend = dnorm(abs(y0 - mtcars[pt,]$mpg), 0, sigma) / dnorm(0, 0, sigma) * heightFactor + x0, 
                       yend = mtcars[pt,]$mpg)

ggplot(mtcars, aes(x = wt, y = mpg)) +
    # geom_smooth(method = lm, formula = y ~ x, se = FALSE) +
    geom_abline(intercept = intercept, slope = slope) +
    geom_point(data = mtcars, aes(x = wt, y = mpg)) +
    geom_segment(data = segment, aes(x = x, y = y, xend = xend, yend = yend)) +
    geom_segment(data = segment2, aes(x = x, y = y, xend = xend, yend = yend), color = 'red') +
    geom_point(data = mtcars[pt,], aes(x = wt, y = mpg), color = 'red', size = 2) +
    # geom_point(data = segment2, aes(x = xend, y = y), color = 'red', size = 2) +
    geom_vline(xintercept = x0) +
    # geom_hline(yintercept = y0) + # Verify the distribution is centered on y-hat
    geom_path(data = path, aes(x,y), color = "blue")
Figure 8. Likelihood for one observeration superimposed on scatter plot.

Figure 8. Likelihood for one observeration superimposed on scatter plot.

Figure 9 depicts the likelihoods for the first 16 observations.

tmp <- df %>% filter(Iteration == nrow(df))
plots <- list()
nplots <- 16 #nrow(mtcars)
for(i in 1:min(nplots, nrow(mtcars))) {
    a <- tmp[1,]$Intercept
    b <- tmp[1,]$Slope
    sigma <- tmp[1,]$Sigma
    predictor <- mtcars$wt[i]
    predicted.out <- a + b * predictor
    outcome <- mtcars$mpg[i]
    d <- dnorm(outcome, predicted.out, sigma)
    plots[[i]] <- ggplot() +
        stat_function(fun = dnorm,
                      n = 101,
                      args = list(mean = predicted.out, sd = sigma)) +
        annotate(geom = 'segment', x = outcome, y = 0, xend = outcome, yend = d, color = 'red') +
        annotate(geom = 'point', x = outcome, y = d, color = 'red', size = 2) +
        xlim(c(min(mtcars$mpg, predicted.out - 3 * sigma),
               max(mtcars$mpg, predicted.out + 3 * sigma))) +
        ylim(c(0, .2)) +
        ylab('') + xlab(row.names(mtcars)[i])
}
plot_grid(plotlist = plots)
Figure 9. Likelihoods of the first 16 observations for the final parameter estimates.

Figure 9. Likelihoods of the first 16 observations for the final parameter estimates.

Generalized Linear Models

Generalized linear models (GLM) are a generalization of OLS that allows for the response variables (i.e. dependent variables) to have an error distribution that is not normally distributed. All generalized linear models have the following three characteristics:

  1. A probability distribution describing the outcome variable
  2. A linear model
    \(\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n\)
  3. A link function that relates the linear model to the parameter of the outcome distribution
    \(g(p) = \eta\) or \(p = g^{-1}(\eta)\)

We can estimate GLMs using MLE as described above. What will change is the log-likelihood function.

Logistic Regression

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors. We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors. Instead of fitting a line (or a plane for two predictors, etc. for higher dimensions) we wish to fit the data to the logistic function which is defined as:

\[ \sigma \left( t \right) =\frac { { e }^{ t } }{ { e }^{ t }+1 } =\frac { 1 }{ 1+{ e }^{ -t } } \]

logistic <- function(t) { 
    return(1 / (1 + exp(-t))) 
}
ggplot() +
    stat_function(fun = logistic, n = 101) +
    xlim(-4, 4) + xlab('x')
Figure 10. Logistic curve

Figure 10. Logistic curve

To finish specifying the Logistic model we just need to establish a reasonable link function that connects \(\eta\) to \(p\). There are a variety of options but the most commonly used is the logit function which is specified as:

\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]

We can specify t as a linear combination of our predictors (independent variables).

\[ t = \beta_0 + \beta_1 x \]

The logistic function can now be rewritten as:

\[ F\left( x \right) =\frac { 1 }{ 1+{ e }^{ -\left( { \beta }_{ 0 }+\beta _{ 1 }x \right) } } \]

Consider the following data set where we wish to predict whether a student will pass an exam based upon the number of hours they studied.

study <- data.frame(
    Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,
            3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50),
    Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1)
)
ggplot(study, aes(x = factor(Pass), y = Hours)) + geom_boxplot() + xlab('Pass') + ylab('Hours Studied')
Figure 11. Boxplot of hours studied by passing.

Figure 11. Boxplot of hours studied by passing.

First, we need to define logit function and the log-likelihood function that will be used by the optim function. Instead of using the normal distribution as above (using the dnorm function), we are using a binomial distribution and the logit to link the linear combination of predictors.

logit <- function(x, beta0, beta1) {
    return( 1 / (1 + exp(-beta0 - beta1 * x)) )
}
loglikelihood.binomial <- function(parameters, predictor, outcome) {
    a <- parameters[1] # Intercept
    b <- parameters[2] # beta coefficient
    p <- logit(predictor, a, b)
    ll <- sum( outcome * log(p) + (1 - outcome) * log(1 - p))
    return(ll)
}

Now we can call the optim function and get the final parameter estimates.

optim.binomial <- optim_save(
    c(0, 1), # Initial values
    loglikelihood.binomial,
    method = "L-BFGS-B",
    control = list(fnscale = -1),
    predictor = study$Hours,
    outcome = study$Pass
)

In R, the glm (short for generalized linear models) function implements logistic regression when the family = binomial(link = 'logit') parameter is set. See ?glm for other families of models to estimate models with other underlying distributions. We can see that our estimate matches the results of glm to a rounding error.

optim.binomial$par
## [1] -4.077575  1.504624
lr.out <- glm(Pass ~ Hours, data = study, family = binomial(link = 'logit'))
lr.out$coefficients
## (Intercept)       Hours 
##   -4.077713    1.504645
# Redefine the logistic function to include parameter estimates
logistic <- function(x, beta0, beta1) {
    return(1 / (1 + exp(-1 * (beta0 + beta1 * x)) ))
}

beta0 <- optim.binomial$par[1]
beta1 <- optim.binomial$par[2]

ggplot(study, aes(x = Hours, y = Pass)) +
    # geom_smooth(method = 'glm', formula = y ~ x, se = FALSE,
    #           method.args = list(family = binomial(link = 'logit'))) +
    geom_point(aes(color = logistic(Hours, beta0, beta1) > 0.5)) +
    stat_function(fun = logistic, n = 101, 
                  args = list(beta0 = beta0, beta1 = beta1) ) +
    scale_color_hue('Predicted Pass > 0.5') +
    theme(legend.position = c(0.85, 0.15))

Let’s explore the process of the numeric optimizer. For this model, it took 70 iterations to converge to resulting parameters.

df <- optim.binomial$iterations_df
names(df) <- c('Intercept', 'Hours', 'LogLikelihood', 'Iteration')
xlim <- c(0, 6) # Hard coding for now
df2 <- data.frame(Iteration = rep(1:nrow(df), each = 100))
xvals <- seq(xlim[1], xlim[2], length.out = 100)
tmp <- apply(
    df, 1, FUN = function(x) {
        logistic(xvals, x[1], x[2])
    }
) %>% as.data.frame()
names(tmp) <- 1:ncol(tmp)
tmp <- melt(tmp)
names(tmp) <- c('Iteration', 'Pass')
tmp$Hours <- rep(xvals, nrow(df))

nFrames <- nrow(df) * 2
p1 <- ggplot() + 
    geom_smooth(data = study, aes(x = Hours, y = Pass),
        method = 'glm', formula = y ~ x, se = FALSE, alpha = 0.5,
        method.args = list(family = binomial(link = 'logit'))) +
    geom_point(data = study, aes(x = Hours, y = Pass)) + 
    geom_path(data = tmp, aes(x = Hours, y = Pass, group = Iteration)) +
    transition_states(Iteration) +
    labs(title = "Iteration: {round(frame/2)}") +
    shadow_wake(wake_length = 0.1, alpha = FALSE) +
    ease_aes("cubic-in")
p1_gif <- animate(p1, nframes = nFrames, width = 480, height = 480)

df.melt <- df %>% melt(id.var = 'Iteration')
p2 <- ggplot(df.melt, aes(x = Iteration, y = value, color = variable)) +
    geom_vline(data = data.frame(Iteration2 = df$Iteration),
               aes(xintercept = Iteration2, frame = Iteration2)) +
    geom_path() +
    facet_wrap(~ variable, scales = "free_y", ncol = 1) +
    xlab('Iteration') + ylab('Parameter Estimate') +
    theme(legend.position = 'none') +
    transition_time(Iteration2)
p2_gif <- animate(p2, nframes = nFrames, width = 480, height = 480)

new_gif <- image_append(c(p1_gif[1], p2_gif[1]))
for(i in 2:nFrames){
    combined <- image_append(c(p1_gif[i], p2_gif[i]))
    new_gif <- c(new_gif, combined)
}
new_gif

pt <- 1
beta0 <- optim.binomial$par[1]
beta1 <- optim.binomial$par[2]

df2 <- data.frame(x = xvals,
                  p = logistic(xvals, beta0, beta1))

ggplot(df2, aes(x = x, y = p)) + 
    geom_path() + 
    geom_segment(aes(x = study[pt,]$Hours, xend = study[pt,]$Hours,
                     y = study[pt,]$Pass, yend = logit(study[pt,]$Hours, beta0, beta1)),
                 color = 'red', size = 1) +
    geom_point(aes(x = study[pt,]$Hours, y = logit(study[pt,]$Hours, beta0, beta1)),
               color = 'red', size = 4)

study$p <- logit(study$Hours, beta0, beta1)
ggplot(df2, aes(x = x, y = p)) + 
    geom_path() + 
    geom_segment(data = study, aes(x = Hours, xend = Hours,
                                   y = Pass, yend = p), color = 'red', size = 1) +
    geom_point(data = study, aes(x = Hours, y = p), color = 'red', size = 4)